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ARCHITECTURES FOR DISCRETE WAVELET TRANSFORMS 



BACKGROUND OF THE INVENTION 

The invention relates to architectures for implementing discrete wavelet trans- 
forms (DWTs). The invention relates to any field where DWTs may be in use 
which is particularly but not exclusively concerned with architectures used in the 
fields of digital signal and image processing, data compression, multimedia, and 
communications. 

A list of documents is given at the end of this description. These documents are 
referred to in the following by their corresponding numeral in square brackets. 

The Discrete Wavelet Transform (DWT) [1]-[4] is a mathematical technique that 
decomposes an input signal of length N = r x k m in the time domain by using di- 
lated/contracted and translated versions of a single basis function, named the 
prototype wavelet. In one particular case N=2 m . DWTs can be performed using 
Haar wavelets, Hadamard wavelets and wavelet packets. Decomposition by Haar 
wavelets involves low-pass and high-pass filtering followed by downsampling by 
two of both resultant bands and repeated decomposition of the low-frequency 
band to J levels or octaves. 

In the last decade, the DWT has often been found preferable to other traditional 
signal processing techniques since it offers useful features such as inherent scal- 
ability, computational complexity of 0{N) (where N is the length of the processed 
sequence), low aliasing distortion for signal processing applications, and adaptive 
time-frequency windows. Hence, the DWT has been studied and applied to a wide 
range of applications including numerical analysis [5]-[6], biomedicine [7], image 
and video processing [1], [8]-[9], signal processing techniques [10] and speech 
compression/decompression [11]. DWT based compression methods have be- 
come the basis of such international standards as JPEG 2000 and MPEG-4. 



In many of these applications, real-time processing is required in order to achieve 
useful results. Even though DWTs possess linear complexity, many applications 
cannot be handled by software solutions only. DWT implementations using digital 
signal processors (DSPs) improve computation speeds significantly, and are suffi- 
cient for some applications. However, in many applications software DWT imple- 
mentations on general purpose processors or hardware implementations on DSPs 
such as TMS320C6x are too slow. Therefore, the implementation of the DWT by 
means of dedicated very large scale integrated (VLSI) Application Specific Inte- 
grate Circuits (ASICs) has recently captivated the attention of a number of re- 
searchers, and a number of DWT architectures have been proposed [12]-[24]. 
Some of these devices have been targetted to have a low hardware complexity. 
However, they require at least 2N clock cycles (cc's) to compute the DWT of a se- 
quence having A/samples. Nevertheless, devices have been designed having a 
period of approximately N cc's (e.g., the three architectures in [14] when they are 
provided with a doubled hardware, the architecture A1 in [15], the architectures in 
[16]-[18], the parallel filter in [19], etc.). Most of these architectures use the Recur- 
sive Pyramid Algorithm (RPA) [26], or similar scheduling techniques, in order both 
to reduce memory requirement and to employ only one or two filter units, inde- 
pendently from the number of decomposition levels (octaves) to be computed. 
This is done producing each output at the "earliest" instance that it can be pro- 
duced [26]. 

Architectures [17], [18] consist of only two pipeline stages where the first pipeline 
stage implements the first DWT octave and the second stage implements all of the 
following octaves based on the RPA. Even though the architectures of [17] and 
[18] operate at approximately 100% hardware utilisation for a large enough num- 
ber of DWT octaves, they have complex control and/or memory requirements. 
Furthermore because they employ only two pipelining stages they have relatively 
low speeds. The highest throughput achieved in conventional architectures is 
N = 2 m clock cycles for implementing a 2 m -point DWT. Approximately 100% hard- 
ware utilisation and higher throughput is achieved in previously proposed archi- 
tectures [31], [32]. 
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The demand for low power VLSI circuits in modern mobile/visual communication 
systems is increasing. Improvements in the VLSI technology have considerably 
reduced the cost of the hardware. Therefore, it is often worthwhile reducing the 
5 period, even at the cost of increasing the amount of hardware. One reason is that 
low-period devices consume less power. For instance, a device D having a period 
T-N/2 cc's can be employed to perform processing which is twice as fast as a de- 
vice D' having a period T=A/cc's. Alternatively, if the device D is clocked at a fre- 
quency f then it can achieve the same performance as the device D' clocked at a 
10 frequency f =2f. Therefore, for the device D the supply voltage (linear with respect 
to t) and the power dissipation (linear with respect to f 2 ) can be reduced by factors 
of 2 and 4 respectively with respect to the supply voltage of the device D' [27]. 

High throughput architectures typically make use of pipelining or parallelism in 
15 which the DWT octaves are implemented with a pipeline consisting of similar 
hardware units (pipeline stages). Even though pipelining has been already ex- 
ploited by existing DWT architectures (e.g., those in [12], [23]-[24]), the fastest 
pipelined designs need at least N time units to implement an N -point DWT. 

20 Most of the known designs for implementation of DWTs are based on the tree- 
structured filter bank representation of DWT shown in Figure 1 where there are 
several (j ) stages (or octaves) of signal decomposition each followed by down- 
sampling by a factor of two. As a consequence of downsampling, the amount of 
data input to each subsequent decomposition stage is half the amount input to the 

25 immediately previous decomposition stage. This makes the hardware of decom- 
position stages in a typical pipelined device designed to implement DWT using the 
tree-structured approach heavily under-utilised, since the stage implementing the 

octave jr = l 9 ...,/ is usually clocked at a frequency 2 j ~ l times lower than the clock 
frequency used in the first octave [24]. This under-utilisation comes from a poor 
30 balancing of the pipeline stages when they implement the DWT octaves and leads 
to a low efficiency. 
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In [30] a pipeline architecture has been proposed based on the tree-structured fil- 
ter bank representation which achievies approximately 100% hardware utilisation 
and throughput of N/2 = 2 m ~ l clock cycles for a 2 m -point DWT. This involves a j- 
5 stage pipeline using, as far as it is possible, half as many processing units from 
one stage to the next stage. 

Known parallel or pipelined architectures essentially depend on DWT parameters 
such as input length, the number of octaves, the length and, in some cases, the 
1 0 actual coefficient values of the low-pass and high-pass filters. For larger values of 
these parameters, these architectures can be very large. Furthermore, it is only 
possible to implement a DWT with fixed parameters within a given hardware reali- 
zation of a given architecture. However, in JPEG 2000, a DWT is separately ap- 
plied to tiles of an image, in which the sizes of the tiles may vary from one to 

15 2 32 - l . The number of octaves of decomposition may vary fromo to 255 for differ- 
ent tiles. Thus it is desirable to have a device capable of implementing DWTs with 
varying parameters or, in other words, a unified device that is relatively independ- 
ent of the DWT parameters. Designing such a device is straightforward in the case 
of serial architectures. It is not so straightforward in the case of parallel or pipe- 

20 lined architectures. 

Most of the conventional architectures [12]-[26] employ a number of multipliers 
and adders proportional to the length of the DWT filters. Even though some ar- 
chitectures [17], [18] are able of implementing DWTs with varying number of oc- 
25 taves, their efficiency decreases rapidly as the number of octaves increases. 

SUMMARY OF THE INVENTION 

According to the aspects of the invention, the present invention is directed to a 
30 microprocessor structure for performing a discrete wavelet transform operation. In 
one embodiment, the discrete wavelet transform operation comprises decomposi- 
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tion of an input signal vector comprising a number of input samples, over a speci- 
fied number of decomposition levels j, where j is an integer in the range 1 to J, 
starting from a first decomposition level and progressing to a final decomposition 
level. The microprocessor structure has a number of processing stages, each of 
5 the stages corresponding to a decomposition level j of the discrete wavelet trans- 
form and being implemented by a number of basic processing elements. The 
number of basic processing elements implemented in each of the processing 
stages decreases by a constant factor at each increasing decomposition level j. 

10 

These are generally scalable structures which are based on flowgraph represen- 
tation of DWTs. 

In this invention, general parametric structures of two types of DWT architectures, 
15 referred to as Type 1 and Type 2 core DWT architectures are introduced, as well 
as general parametric structures of two other DWT architectures which are con- 
structed based on either a core DWT architecture and are referred to as the multi- 
core DWT architecture and the variable resolution DWT architecture, respectively. 
All the architectures can be implemented with a varying levels of parallelism thus 
20 allowing a trade-off between the speed and hardware complexity. In this invention 
advantages of both parallel and pipelined processing are combined in order to de- 
velop DWT architectures with improved efficiency (hardware utilisation) and, con- 
sequently, with improved throughput or power consumption. General structures of 
several DWT architectures operating at approximately 100% hardware utilisation 
25 at every level of parallelism are proposed. The architectures presented are rela- 
tively independent of the size of the input signal, the length of the DWT filters, and 
in the case of a variable resolution DWT architecture also on the number of oc- 
taves. The architectures can be implemented with varying levels of parallelism 
providing an opportunity to determine the amount of hardware recources required 
30 in a particular application and to trade-off between speed, cost, chip area and 
power consumption requirements. In addition, the architectures demonstrate ex- 
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cellent area-time characteristics compared to the existing DWT designs. The in- 
vention provides architectures which are regular and easily controlled, which do 
not contain feedback, long (depending on the length of the input) connections or 
switches. They can be implemented as semisystolic arrays. 



BRIEF DESCRIPTION OF THE DRAWINGS 



Embodiments of the invention will now be described, by way of example only, with 
reference to the accompanying drawings in which: 



Figure 1 shows the tree-structured definition/representation of DWTs on which 

most of the known DWT architectures are based; 
Figure 2 shows an example of a new flowgraph representation of DWTs on 

which the architectures proposed in this invention are based; 
Figure 3 shows an embodiment of a compact form flowgraph representation of 

DWTs; 

Figure 4 shows the general architecture of two types of core DWT architectures 

according to the invention; 
Figure 5 shows a possible embodiment of one stage of the Type 1 architecture 

of Figure 4; 

Figure 6 shows an embodiment of the Type 1 core DWT architecture embodi- 
ment of Figure 5 corresponding to the parameters: p=L max =6; J=3; N=2 m , 
m=3,4,...; 

Figure 7 shows another embodiment of the Type 1 Core DWT architecture em- 
bodiment of Figure 5 corresponding to the parameters: p=L max =6; J=3; 
N=2 m , m=3,4,...; 

Figure 8 shows four possible embodiments of PEs that can be used in the Type 

1 core DWT architecture of Figure 4; 
Figure 9 shows a possible embodiment of one stage of the Type 2 core DWT 

architecture of Figure 4; 



Figure 10 shows an embodiment of the possible realisation of the Type 2 core 
DWT architecture; 

Figure 1 1 shows the general architecture of the multi-core DWT architecture ac- 
cording to the invention; 
5 Figure 12 shows the general architecture of the variable resolution DWT archi- 
tecture according to the invention; 
Figure 13 shows a plot of the delay versus number of BUs for known architec- 
tures and DWT architectures according to the invention; 
Figure 14 shows Table 1; 
10 Figure 15 shows Table 2; and 
Figure 16 shows Table 3. 



DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 



15 To describe architectures proposed in this invention we first need to define the 
DWT and to present the basic algorithm that is implemented within the architec- 
tures. There are several alternative definitions/representations of DWTs such as 
the tree-structured filter bank, the lattice structure, lifting scheme or matrix repre- 
sentation. The following discussion uses the matrix definition and a flowgraph rep- 

20 resentation of DWTs which is very effective in designing efficient parallel/pipelined 
DWT architectures. 



A discrete wavelet transform is a linear transform y = h x , where 
x = [jc 0 ,.»,^-iF an d y = [yo>-> yN-\Y are the input and the output vectors of 
25 length N = 2 m , respectively, and H is the DWT matrix of order NxN which is 
formed as the product of sparse matrices: 

H=HWHV- V >:..-H*\ l<J<m; 
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where i k is the identity (kxk) matrix (k = 2 m -2 m J+] ), and Dj is the analysis 
^m-j+i x 2 m-7+i j ma t r j x a t stage j having the following structure: 
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where LP = [l l ,...j L ] and HP = [h u ...,h L ] are the vectors of coefficients of the low- 
pass and of the high-pass filters, respectively (L being the length of the filters), 
and Pj is the matrix of the perfect unshuffle operator (see [31]) of the size 

\^m-j+\ X 2 m ~^ l \ For the sake of clarity both filters are assumed to have the same 
length which is an even number. The result may be readily expanded to the gen- 
eral case of arbitrary filter lengths. In the general case (that is where N = r x k m 
rather than N=2 m ), where k is not equal to two (that is there are other than two fil- 
tering operations carried out in each PE and there are other than two outputs from 
each PE), a suitable stride permutation rather than the unshuffle operation is ap- 
plied. 



15 Adopting the representation (1)-(2), the DWT is computed in J stages (also called 
decomposition levels or octaves), where the y th stage, 7 = 1 j , constitutes mul- 
tiplication of a sparse matrix h U) by a current vector of scratch variables, the first 
such vector being the input vector x . Noting that lower right corner of every matrix 
H (J) is an identity matrix and taking into account the structure of the matrix d J3 

20 the corresponding algorithm can be written as the following pseudocode where 
x^=te(0) f ... 9 ^(2^-i)f, and *&=fe(0),..., 4^(2^ -lj, j = , are (2^x1) 

vectors of scratch variables, and the notation ][x { ) T ,...,(** ) r f stands for concatena- 
tion of column vectors x, . 
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Algorithm 1. 



1. Set x<% = te(0),...,x2»(2" = x; 



2. For j = i,...,j compute 



Jf^=te(0) and x« =[*$(0),....*$(2«-' -i)f 



5 



where 



(3) 



or, equivalently, 



2. For / = o,...,2 ra --'-i, 
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Begin 

Form the vector //* a subvector of length l of the vector xf; 1J *// 



x = [x% l \2i)J£ } {2i + l),^; 1J ((2/ + 2)mod2 w - y+I ),...,^; 1J ((2i + L-l)mod2 m ~- /+1 ))] r ; 
Compute 

^(i) = LP-jf; x%(i) = hp-x; 

End 

15 3. Form the output vector 



Computation of Algorithm 1 with the matrices d j of (2) can be demonstrated using 

a flowgraph representation. An example for the case N-2 3 = 8, L = 4, J = 3 is 
20 shown in Figure 2. The flowgraph consists of J stages, the j-th stage, j = 1,..., J , 

having 2 m ~ J nodes (depicted as boxes on Figure 2). Each node represents a ba- 
sic DWT operation (see Figure 2(b)). The rth node, / = o,...,2 w ~- / '-i, of stage j = i 9 ...,J 
has incoming edges from L circularly consecutive nodes 

2x\2f + 1,(2/ + 2)mod 2 m ~ (2i + l - i)mod 2 m_jf+1 of the preceding stage or (for the 
25 nodes of the first stage) from inputs. Every node has two outgoing edges. An up- 
per (lower) outgoing edge represents the value of the inner product of the vector 
of low-pass (high-pass) filter coefficients with the vector of the values of incoming 
edges. Outgoing values of a stage are permuted according to the perfect unshuf- 



v^\x (J) x (J) X 1 

y i lp 7 hp > , 
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fie operator so that all the low-pass components (the values of upper outgoing 
edges) are collected in the first half and the high-pass components are collected 
at the second half of the permuted vector. Low pass components then form the 
input to the following stage or (for the nodes of the last stage) represent output 
values. High-pass components and the low pass components at that stage repre- 
sent output values at a given resolution. 

Essentially, the flowgraph representation provides an alternative definition of dis- 
crete wavelet transforms. It has several advantages, at least from the implemen- 
tation point of view, as compared to the conventional DWT representations such 
as the tree-structured filter bank, lifting scheme or lattice structure representation. 

However, the flowgraph representation of DWTs as it has been presented has a 
disadvantage of being very large for bigger values of N . This disadvantage can 
be overcome based on the following. Assuming J <iog 2 N i.e. in the level of de- 
composition is « number of points in the input vector (in most applications 

J «log 2 A r ) one can see that the DWT flowgraph consists of N/2 J similar pat- 
terns (see the two hatching regions on Figure 2). Each pattern can be considered 
as a 2 J -point DWT with a specific strategy of forming the input signals to each of 
its octaves. The 2 W ~* /+1 input values of the j-th, j = 1,..., J , octave are divided 
within the original DWT (of length N = 2 m ) into N/2 J = 2 m ~ J non-overlapping 
groups consisting of 2 J ~ j+l consecutive values. This is equivalent to dividing the 
vector x% x) of (3) into subvectors x u ~ U) =x£> u (s-2 y - y+1 :(j+i)-2 J " J,+l 
5- = 0,...,2 m ~ J -l, where here and in the following the notation x{a:b) stands for the 
subvector of x consisting of the a -th to b -th components of x . Then the input of 
the j-th, j = l,...,J , octave within the s-th pattern is the subvector 
xO-hs){p :2 J-j+i +L _ 3 ) 0 f the vector, 
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(4) 



being the concatenation of the vector x ( [~ U} with the circularly next Q } vectors 
where Qj =\(L~2)/2 J -^\ 

If the 2 m ~ J patterns are merged into a single pattern, a compact (or core) flow- 
graph representation of the DWT is obtained. An example of a DWT compact 
flowgraph representation for the case J=3, L=4 is shown in Figure 3. The compact 
DWT flowgraph has 2 J ~ j nodes at its y'-th, stage, j = J , where a set of 2 m ~ J 
temporally distributed values are now assigned to every node. Every node has l 
incoming and two outgoing edges like in the ("non-compact") DWT flowgraph. 
Again incoming edges are from l "circularly consecutive" nodes of the previous 
stage but now every node represents a set of temporally distributed values. 
Namely, the L inputs of the / th node of the y-th, stage, j = l,..„/ , for its sth value, 

s = o,...,2 m ~ J -l are connected to the nodes (2i+n)mod2- / ~' + \ n = o,...,L-i of the (y-i)st 
stage which now represent their (s+s')th values where / = [(2*+/0/2 J ~ ;+1 j. Also, out- 
puts are now distributed over the outgoing edges of the compact flowgraph not 
only spatially but also temporally. That is, each outgoing edge corresponding to a 
high-pass filtering result of a node or low-pass filtering result of a node of the last 
stage represents a set of 2 m ~ J output values. Note that the structure of the com- 
pact DWT flowgraph does not depend on the length of the DWT but only on the 
number of decomposition levels and filter length. The DWT length is reflected only 
in the number of values represented by every node. It should also be noted that 
the compact flowgraph has the structure of a 2 J -point DWT with slightly modified 
appending strategy. In fact, this appending strategy is often used in matrix formu- 
lation of the DWT definition. 
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Let 6j denote the main (2 J ~ J+l x(2 J ~ J+l +L-2))-minor of D j (see (2)), j = u.,J , that is 
let 5j be a matrix consisting of the first 2 J ~ J+] rows and the first 2 J ' j+l + L-2 col- 
umns of Dj. For example, if j-j+i = 2 and l = 6, then d } is of the form: 



d j = 



( /, l 2 l 3 l 4 l 5 l 6 0 0 ^ 

oo/, i 2 i 3 i 4 i 5 i 6 

h } h 2 h 3 h 4 h 5 h 6 0 0 

0 0 h x h 2 h 3 h 4 h 5 h 6 



Adopting the notation of (4), the computational process represented by the com- 
pact flowgraph can be described with the following pseudocode. 
Algorithm 2. 

1. For 5=0,...,2 w " y -lSet x ( £ s) = x(s-2 J :(s + l>2 y -1); 

10 2. For j = i,...,j 

For s ^o,...a m ~ J -i 
Begin 

2.1. Set x°- hs} according to (4) 

2.2. Compute [{x%>Y ,{x<&>Yj -x^(o:2 7 ^ +l 

15 End 

3. Form the output vector 



y = 



Implementing the cycle for s in parallel yields a parallel DWT realisation. On the 
20 other hand, by exchanging the nesting order of the cycles for j and s and imple- 
menting the (nested) cycle for j in parallel it is possible to implement a pipelined 
DWT realisation. However, both of these realisations would be inefficient since the 
number of operations is halved from one octave to the next. However, combining 
the two methods yields very efficient parallel-pipelined or partially parallel- 
25 pipelined realisations. 
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To apply pipelining to the Algorithm 2, retiming must be applied since computa- 
tions for s include results of computations for + -l meaning that the j th 

octave, j = i,..., J , introduces a delay of Q } steps. Since the delays are accumu- 
lated, computations for the yth octave, y = i y , must start with a delay of 

during the steps s = s*ui-2 m ~ J +s*(j)-i . Thus, computations take place starting 
from the step j = j*(1) until the step s = s*(j)+2 m - J -l . At steps $ = j*(1),...,j*(2)-i 
computations of only the first octave are implemented, at steps $ = j*(2),...,5*(3)-i 
only operations of the first two octaves are implemented, etc. Starting from step 

10 s = s*(J) until the step s = s*(i)+2 m ~ J -l (provided **(./) <**(i)+2 m - y ) computations of 
all the octaves j = i,...,j are implemented, but starting from step s = s*(X) + 2 m ~ J no 
computations for the first octave are implemented, starting from step s = s*(2)+2 w_y 
no computations for the first two octaves are implemented, etc. In general, at step 
5-5*(i),...,2 w ~ y +j*(J)-i computations for octaves j = j l ,...,j 2 are implemented 

15 where 7j - min {/ suchthat s * (j) < s < s * (j) + 2" 1 " 7 } and 

7! =max{y suchthat 5 < j < s*c/) + 2 w ~ y }. The following pseudocode presents the 
pipelined DWT realisation which is implemented within the architectures proposed 
in this invention. 
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1. 
2. 



Algorithm 3. 

For 5 = 0,...,2 m ' J -1 set x^' = x(s-2 J :(i + l)-2 y -1); 
For S = s*(i),...,2 m - J 
For j = j l ,...,j 2 do in parallel 
Begin 

2.1. Set sto-u^o)) according to (4) 

2.2. Compute 



(xfp 5 ™ J , {*fp* (m J J = D ] ■ x ( ^ w> (p : 2 J - j+ > + L - 3) . 



End 

Form the output vector 



-1 7* 



In this invention, general parametric structures of two types of DWT architectures, 
1 5 referred to as Type 1 and Type 2 core DWT architectures are introduced, as well 
as general parametric structures of two other DWT architectures which are con- 
structed based on either a core DWT architecture and are referred to as the multi- 
core DWT architecture and the variable resolution DWT architecture, respectively. 
All the architectures can be implemented with a varying level of parallelism thus 
20 allowing a trade-off between the speed and hardware complexity. Depending on 
the level of parallelism throughputs up to constant time implementation (one 2 m - 
point DWT per time unit) may be achieved. At every level of parallelism the archi- 
tectures operate with approximately 100% hardware utilisation thus achieving al- 
most linear speed-up with respect to the level of parallelism compared with the se- 
25 rial DWT implementation. The architectures are relatively independent on the 
DWT parameters. That is, a device having one of the proposed architectures 
would be capable of efficiently implementing not only one DWT but a range of dif- 
ferent DWTs with different filter length and, in the case of variable resolution DWT 
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architecture, with a different number of decomposition levels over vectors of arbi- 
trary length. 

Many different realisations of the proposed architectures are possible. Therefore, 
5 their general structures are described at a functional level. Realisations of the 
general structures at the register level are also presented. These realisations 
demonstrate the validity of the general structures. The proposed architectures are 
essentially different in terms of functional description level from known archite- 
cures. 

10 

The Type 1 and Type 2 core DWT architectures implement a 2 m -point DWT with 
J <m octaves based on low-pass and high pass filters of a length L not exceed- 
ing a given number L max where / and L max (but not m or L) are parameters of 
the realisation. Both Type 1 and Type 2 core DWT architectures comprise a serial 

15 or parallel data input block and J pipeline stages, the j th stage, j = i,...,y , con- 
sisting of a data routing block and 2 J ~ j processing elements (PEs). This will be 
described in relation to Figure 4. The jth pipeline stage, 7 = 1,..., J , of the archi- 
tecture implements the 2 m ~ j independent similar operations of the 7 th DWT oc- 
tave in 2 m ~ J operation steps. At every step, a group of 2 J ~ j operations is imple- 

20 mented in parallel within 2 J ~ j PEs of the pipeline stage. The PEs can be imple- 
mented with varying level of parallelism, which is specified by the number p<L max 
of its inputs. A single PE with p inputs implements one basic DWT operation (see 
Fig. 2,(b)) in one operation step consisting of \hl p~\ time units where at every time 
unit the results of 2p multiplications and additions are obtained in parallel. Thus 

25 the time period (measured as the intervals between time units when successive 
input vectors enter to the architecture) is equal to 2 m ~ J [l/ p] time units where the 
duration of the time unit is equal to the period of one multiplication operation. This 
is 2 J p] times faster than the best period of previously known architectures 
[12-26] and 2* 7 " 1 p] faster than the architectures described in [30]. The effi- 
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ciency (or hardware utilisation) of both architectures is equal 
L/(/?[L/p])-l00%«i00%. In the case of p = L = L max the period is 2 m ~ J time units 
which is the same as for the LPP architecture which, however, depends on the 
filter length L (i.e. the LPP architecture is only able to implement DWTs with filters 
of a fixed length L). The two types of core DWT architecture differ according to 
the absence (Type 1) or presence (Type 2) of interconnection between the PEs of 
one pipeline stage. Possible realisations of the two types of core DWT architec- 
tures are presented in Figures 5 to 10. The two types of core DWT architecture 
described above may be implemented with a varying degree of parallelism de- 
pending on the parameter P . 

Further flexibility in the level of parallelism is achieved within multi-core DWT ar- 
chitectures by introducing a new parameter r = i,...,2 m ~ y . The multi-core DWT ar- 
chitecture is, in fact, obtained from corresponding (single-)core DWT architecture 
by expanding it r times. Its general structure is presented in Figure 1 1 . The ar- 
chitecture consists of a serial or parallel data input block and / pipeline stages, 
the yth pipeline stage, j = l,...,J, consisting of a data routing block and r2 J ~J PEs. 

The time period of the multi-core DWT architecture is equal to (2 m " y [li p§lr time 
units which is r times faster than that of single-core DWT architecture, i.e. a linear 
speed-up is provided. The efficiency of the multi-core DWT architecture is the 
same as for single-core architectures, that is, approximately 100%. Note that in 
the case of p = L = L max and r = 2 m ~ J the period is just one time unit for a 2 m -point 
DWT. Similar performance is achieved in the FPP architecture which can be con- 
sidered as a special case (p = L = L max and r - 2 m ~ J ) of a possible realisation of the 
multi-core DWT architecture. EP/US] 

The (single- and multi-)core DWT architectures are relatively independent of the 
length of the input and on the length of the filters, which means that DWTs based 
on arbitrary filters (having a length not exceeding L max ) over signals of arbitrary 
length can be efficiently implemented with the same device having either Type 1 



or Type 2 core DWT architecture. However, these architectures are dependent on 
the number of DWT octaves J . They may implement DWTs with smaller than j 
number of octaves, though with some loss in hardware utilisation. 

5 The variable resolution DWT architecture implements DWTs with arbitrary number 
of octaves r and the efficiency of the architecture remains approximately 100% 
whenever r is larger than or equal to a given number. The variable resolution 
DWT architecture comprises a core DWT architecture corresponding to y min de- 
composition levels and an arbitrary serial DWT architecture, for, instance, an RPA- 

10 based architecture (see Figure 12,(a)). The core DWT architecture implements the 
first j min octaves of the r -octave DWT and the serial DWT architecture imple- 
ments the last r~J min octaves of the r -octave DWT. Since the core DWT archi- 
tecture may be implemented with a varying level of parallelism it can be balanced 
with the serial DWT architecture in such a way that approximately 100% of hard- 

1 5 ware utilisation is achieved whenever r > y min . 

A variable resolution DWT architecture based on a multi-core DWT architecture 
may also be constructed (see Figure 12,(b)) in which a data routing block is in- 
serted between the multi-core and serial DWT architectures. 

20 

The proposed DWT Architectures 

This section presents the general structures of two types of DWT architecture, re- 
ferred to as Type 1 and Type 2 core DWT architectures, as well as two other DWT 
architectures which are constructed based on either core DWT architecture and 

25 are referred to as the multi-core DWT architecture and the variable resolution 
DWT architecture, respectively. The multi-core DWT architecture is an extension 
of either one of the core DWT architectures, which can be implemented with a 
varying level of parallelism depending on a parameter m , and in a particular case 
( m = l ) it becomes the (single-)core DWT architecture. For ease of understanding 

30 the presentation of the architectures starts with a description of the (single-)core 
DWT architectures. 
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Both types of core DWT architecture implement an arbitrary discrete wavelet 
transform with j decomposition levels (octaves) based on low-pass and high- 
pass filters having a length L not exceeding a given number L max . Their operation 
5 is based on Algorithm 3 presented earlier. The general structure representing both 
types of core DWT architecture is presented in Figure 4, where dashed lines de- 
pict connections, which may or may not be present depending on the specific re- 
alisation. Connections are not present in Type 1 but are present in Type 2. In both 
cases the architecture consists of a data input block and J pipeline stages, each 

10 stage containing a data routing block and a block of processor elements (PEs) 
wherein the data input block implements the Step 1 of the Algorithm 3, data rout- 
ing blocks are responsible for Step 2.1 , and blocks of PEs are for computations of 
the Step 2.2. The two core architecture types mainly differ by the possibility of 
data exchange between PEs of the same pipeline stage. In the Type 2 core DWT 

15 architecture PEs of a single stage may exchange intermediate data via intercon- 
nections while in the Type 1 core DWT architecture there are no interconnections 
between the PEs within a pipeline stage and thus the PEs of a single stage do not 
exchange data during their operation. 

20 In general many different realisations of data routing blocks and blocks of PEs are 
possible. Therefore, in one aspect, the invention can be seen to be the architec- 
tures as they are depicted at the block level (Figures 4, 1 1 , and 12) and as they 
are described below at the functional level independent of the precise implemen- 
tation chosen for the PEs and data routing blocks. However, some practical reali- 

25 sations of the proposed core DWT architectures at register level are presented by 
way of example with reference to Figures 5 to 10. These exemplary implementa- 
tions demonstrate the validity of the invention. 

Figure 4 presents the general structure of the Type 1 and Type 2 core DWT ar- 
30 chitecture. As explained in the foreqoing, Type 1 and Type 2 differ only in the lack 
or presence of interconnection between the Pes within a stage. The data input 
block of both core DWT architectures may be realized as either word-serial or 



word-parallel. In the former case the data input block consists of a single (word- 
serial) input port which is connected to a shift register of length 2 J (dashed lined 
box in Figure 4) having a word-parallel output from each of its cells. In the latter 
case the data input block comprises 2 J parallel input ports. In both cases the data 
5 input block has 2 J parallel outputs which are connected to the 2 J inputs of the 
data routing block of the first pipeline stage. In Figure 6 an example of a word- 
parallel data input block is presented while Figure 7 and 10 present an example of 
a word-serial data input block. 

10 Type 1 core DWT architecture 

The basic algorithm implemented within the Type 1 core DWT architecture is Algo- 
rithm 3 with a specific order of implementing Step 2.2. The structure of the 
2 j-j+i x2 j-j^ matrix b J is such that the matrix-vector multiplication of Step 2.2 can 

be decomposed into 2 J ' J pairs of vector-vector inner product computations: 

-1 5 x< J ~ s - s * ij)) (i) = LP • x (j - Us - s%j)) \2i : 2i + L - 1) , 

x°* s - s * W) (i + 2 J ~ } ) = HP • x (J - Us - s * U))) (2i : 2i + L - 1) , 
i=CU,2 y -' -1 

which can be implemented in parallel. On the other hand, every vector-vector in- 
ner product of length L can be decomposed into a sequence of L p = p] inner 
20 products of length P with accumulation of the results (assuming that the coeffi- 
cient vectors and input vectors are appended with appropriate number of zeros 
and are divided into subvectors of consecutive P components). As a result, Algo- 
rithm 3 can be presented with the following modification of the previous pseudo- 
code. 



i 
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10 



Algorithm 3.1. 

1. For 5 = 0,...,2 m - y -l Set x^ = x(*-2 y :(* + l)-2 y -1); 

2. For i = 5*(i),...,2'" _ ' / +j*(7)-i 
For j = j 1 ,...,j 2 do in parallel 
Begin 

2.1. Set st(J<"*o» according to (4) 

2.2. For i = o,...,2 y- '-i do in parallel 

Begin 
Set 5^(0 = 0, s HP a) = o; 
For n = 0,...,L p -l do in sequential 
Begin 



S LP (0 = S LP (0 + j? / xO-u--(y) J (2/ + np + k) ; (6) 
5 HP (0 = 5^(0 + X K p+k x^ u - s ^>{2i + np + k) ; (7) 



15 End 



Set xft-™ (0 = s,, (0 ; xgjr*™ (0 = 5 HP (0 
End 

End 

3. Form the output vector 



20 y = 



if 
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Note that given s and j , the group of operations (6) and (7) involve the subvector 

£(>i.^'W)( 0: 2-'-; +I + p-3j for n = o, the subvector x0- 1 '«*w)( p: 2- / --' +1 +2p-3) and, in gen- 
eral, the subvector x^~ lSj ' n } = x {J - l - s - s * 0>) (n P :2 J - J+l +(n+i) P -3) for n = o,...,L p -l . In other 
words, computations for n = o,...,L p -l involve the first 2 J ~ i+l + P -2 components of 
the vector x ^' u ''^ which is obtained from the vector x o-i.«t») by shifting its com- 



ponents left by n P positions. It should also be noted that computations for a given 
i = o,...,2 J ~ J -l always involve the components 2t\2/ + l,...,2i + /?-l of the current 
vector Sc^'^l 

5 The general structure of the Type 1 core DWT architecture is presented in Figure 
4. In the case of this architecture, the dashed lines can be disregarded because 
there are no connections between PEs of a single stage. The architecture consists 
of a data input block (already described above) and J pipeline stages. In general, 
the yth pipeline stage, y = l,..., J, of the Type 1 core DWT architecture comprises a 

10 data routing block having 2 J ~ j+l inputs / wo) (0),i j PS(J) (2 J ~ J+l -i) forming the input 

to the stage, and 2 J ~ J+l + P -2 outputs o Dmj) (0)j f o DRBU) (2 J ~ J+1 + />-3) connected to 

the inputs of 2 J ~ j PEs. Every PE has P inputs and two outputs where p<L max is 
a parameter of the realisation describing the level of parallelism of every PE. Con- 
secutive p outputs o DRB{j) {2i),o DRBU) {2i+\\^ of the data routing block 

15 of the jth, 7 = 1,...,/, stage are connected to the p inputs of the ith, 

i = 0,...,2 J ~ J -1, PE (PE jti ) of the same stage. The first outputs of each of 2 J ~ j PEs 
of the yth pipeline stage, j = i,...,/-i, form the outputs o F5a) (0),i ,o PSU) (2 J - J -l) of 

that stage and are connected to the 2 J ~ j inputs / wo - +0 (0) s i ,i PS (j+\)(2 J ~ J -i) the 
data routing block of the next, (y+i)st, stage. The first output of the (one) PE of 
20 the last, yth, stage is the oth output out(0) of the architecture. The second outputs 
of the 2 J ~ j PEs of the yth pipeline stage, y = i,...,y, form the (2 y - J )thto (2 y -'* +1 -i)st 
outputs out{2 J ' j }^out(2 J ~ J ^i) of the architecture. 

The blocks of the Type 1 core DWT architecture are now described at the func- 
25 tional level. For convenience, a time unit is defined as the period for the PEs to 
complete one operation (which is equal to the period between successive groups 
of p data entering the PE) and an operation step of the architecture is defined as 
comprising l p time units. 
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The functionality of the data input block is clear from its structure. It serially or par- 
allelly accepts and parallelly outputs a group of components of the input vector at 
the rate of i J components per operation step. Thus, the vector x ( ^ s) is formed on 
5 the outputs of the data input block at the step s = o 9 ... 9 2 m ' J -1 ■ 

The data routing block (of the stage y = i,.„,j,) can, in general, be realized as an 
arbitrary circuitry which at the first time unit n = o of its every operation step paral- 
lelly accepts a vector of 2 J ~ j+l components, and then at every time unit 
10 « = o,...,l p -i of that operation step it parallelly outputs a vector of 2 J ~ j+l + P -2 com- 
ponents 2 3 ~ ] ^ -3 of a vector being the concatenation (in chrono- 
logical order) of the vectors accepted at previous Q } steps, where 

Q,=[(L max -2)/2^ +1 l y = l_7. (8) 

15 The functionality of the PEs used in the Type 1 core DWT architecture is to com- 
pute two inner products of the vector on its p inputs with two vectors of predeter- 
mined coefficients during every time unit and to accumulate the results of both in- 
ner products computed during one operation step. At the end of every operation 
step, the two accumulated results pass to the two outputs of the PE and new ac- 

20 cumulation starts. Clearly, every PE implements the pair of operations (6) and (7) 
provided that the correct arguments are formed on their inputs. 

It will now be demonstrated that the architecture implements computations ac- 
cording to Algorithm 3.1 . In the case where L<L max an extra delay is introduced. 

25 The extra delay is a consequence of the flexibility of the architecture that enables 
it to implement DWTs with arbitrary filter length L<L max . This should be compared 
with Algorithm 3.1 which presents computation of a DWT with a fixed filter length 
l . In fact, the architecture is designed for the filter length but also implements 
DWTs with shorter filters with a slightly increased time delay but without losing in 

30 time period. 
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Denote 

KO) =o, £O0 = X; =1 O, + ,-,> y=w- (9) 

5 During operation of the architecture, the vector x' L 0 /' is formed on the outputs of 
the data input block at step s = o,...,2 m ~ J -i and this enters to the inputs of the data 
routing block of the first pipeline stage. To show that the architecture implements 
computations according to Algorithm 3.1 it is sufficient to show that the vectors 
x (u-so)) are formed at the first outputs of PEs of the j th stage (which are con- 

10 nected to the inputs of the o + i)st stage) and the vectors x { ^f s(i)) are formed at 
their second outputs at steps s = sU),-JU)+2 m ~ J -i provided that the vectors 
x (h**hj) enter to the yth stage at steps s = $u-V),...,SU-n+2 m ~ J -i (P roof b y 
mathematical induction). Thus, it is assumed that the data routing block of stage 
j = i,...,J, accepts vectors xfc u4tH>) at steps s = Su-i),...,Kj-n+2 m ~ J -i- Then, ac- 

15 cording to the functional description of the data routing blocks, the components 
np,n P +i,...,(n+i)p+2 J - J+l -3 of the vector 



i j ,(xfc- WMt '- ,)+W2 \x [ LP 



(10) 



being the concatenation of the vectors accepted at steps s-Qj,s-Qj + 2,...,s, re- 
spectively, will be formed on the outputs of the data routing block at the time unit 

20 n = o,....L p -i of every step s = su\-JU)+2 m ~ J -i- Since s(j)Zs*(j) (compare (3) and 
(9)), the vector x (lA - s - m) (defined according to (4)) is the subvector of x ff - ] so 
that their first 2 J ' i+l +L-3 components are exactly the same. Thus the vector 

= x (J - hs -°<j>)(np:2 J - i+[ +(n+\)p-?) \s formed at the time unit « = o L p -i of the 

step s = $u),-, sU)+2 m ~ J -i at the outputs of the data routing block of stage j = \,...j . 

25 Due to the connections between the data routing block and PEs, the components 
2j,2/ + l,...,2z + /?-l of the vector x^ n) which are, in fact, arguments of the opera- 
tions (6) and (7), will be formed on the inputs of the PE Ui , i = o,...,2 J ~ J -l at the time 
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unit n = o,...,L / ,-i of the step s = s(j)^J(j) + 2 m - J -1. Thus, if PEs implement their op- 
erations with corresponding coefficients, the vector x { ^~* (j}) will be formed on the 
first outputs of the PEs and the vector x { ^f (J)) will be formed on their second out- 
puts after the step s = s(j\...JU)+2 m ~ J -i - Since the first outputs of PEs are con- 
5 nected to the inputs of the next pipeline stage this proves that the architecture im- 
plements computations according to the Algorithm 3.1 albeit with different timing 
(replace s*u) with S U) everywhere in Algorithm 3.1). 

From the above considerations it is clear that a 2 m -point DWT is implemented with 
10 the Type 1 core DWT architecture in i m ~ J steps each consisting of L p time 
units. Thus the delay between input and corresponding output vectors is equal to 

T d (Cl) = {2 m - J +sV)\L/p] (11) 

time units. Clearly the architecture can implement DWTs of a stream of input vec- 
tors. It is therefore apparent that the throughput or the time period (measured as 
1 5 the the intervals between time units when successive input vectors enter the ar- 
chitecture) is equal to 



T p (Cl) = 2 m ' J [L/p] (12) 



time units. 



20 Performance of parallel/pipelined architectures is often evaluated with respect to 
hardware utilization or efficiency, defined as 



t (i) 



ioo % (13) 



K • T (K ) 

where r(i) is the time of implementation of an algorithm with one PE and T(K) is 
the time of implementation of the same algorithm with an architecture comprising 
25 k PEs. It can be seen that ro) = t m -^p] time units are required to implement a 
2 m -point DWT using one PE similar to the PEs used in the Type 1 core DWT ar- 
chitecture. Together with (11) and (12), and taking into account the fact that there 
are in total k = 2 j -i PEs within the Type 1 core DWT architecture, it can be shown 
that approximately 100% efficiency (hardware utilisation) is achieved for the ar- 
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chitecture both with respect to time delay or, moreover, time period complexities. It 
should be noted that an efficiency close to the efficiency of the FPP architecture is 
reached only in a few pipelined DWT designs (see [17]) known from the prior art 
whereas most of the known pipelined DWT architectures reach much less than 
5 100% average efficiency. It should also be noted that a time period of at least 
O(N) time units is required by known DWT architectures. The proposed architec- 
ture may be realized with a varying level of parallelism depending on the parame- 
ter P. As follows from (12) the time period complexity of the implementation varies 
between T L {c\) = 2 m ~ J and T l (cv) = L2 m ~ J . Thus the throughput of the architecture is 

10 2 ; /l to 2 ; times faster than that of the fastest known architectures. The possibility 
of realising the architecture with a varying level of parallelism also gives an op- 
portunity to trade-off time and hardware complexities. It should also be noted that 
the architecture is very regular and only requires simple control structures (essen- 
tially, only a clock) unlike, e.g. the architecture of [17]. It does not contain a feed- 

15 back, switches, or long connections that depend on the size of the input, but only 
has connections which are at maximum only 6>(L) in length. Thus, it can be im- 
plemented as a semisystolic array. 

A possible realisation of the Type 1 core DWT architecture 

20 A possible structure of the yth pipeline stage, y = l,„.,7, for the Type 1 core DWT 
architecture is depicted in Figure 5. Two examples of such realisation for the case 
L max =6,y = 3 are shown in Figures 6 and 7, where p = L max =6 and P = 2, respec- 
tively. It should be noted that a particular case of this realisation corresponding to 
the case P = L max and, in particular, a slightly different version of the example on 

25 Figure 6 has been presented which is referred to as a limited parallel-pipelined 
(LPP) architecture. It should be noted that the Type 1 core DWT architecture and 
its realisation in Figure 5 are for the case of arbitrary p. 

Referring to Figure 5, it can be seen that in this realisation the data routing block 
30 consists of q } chain connected groups of 2 J ~ j+l delays each, and a shift register 



of length 2 J ~ J+] +^-2 which shifts the values within its cells by P positions up- 
wards every time unit. The 2 J ~ j+] inputs to the stage are connected in parallel to 
the first group of delays, the outputs of which are connected to the inputs of the 
next group of delays etc. Outputs of every group of delays are connected in par- 
5 allel to the 2 J ~ j+x consecutive cells of the shift register. The outputs of the last 

q ] ,th group of delays are connected to the first 2 J ~ j+1 cells, outputs of the (£>, -Dst 
group of delays are connected to the next 2 J ~ j+] cells, etc., with the exception 
that, the first q s = 0^ -2)-(Q, -m J ' J+1 inputs of the stage are directly connected to 
the last 9j cells of the shift register. The outputs of the first 2 J ' i+l + P -2 cells of the 
1 0 shift register form the output of the data routing block and are connected to the 
inputs of the PEs. In the case P = L mm (see Figure 6) no shift register is required 
but the outputs of the groups of delay elements and the first q s inputs of the stage 
are directly connected to the inputs of the PEs. On the other hand, in the case of 
p = 2 (see Figure 7) interconnections between the data routing block and the PEs 

1 5 are simplified since in this case there are only 2 J ~ j+1 parallel connections from the 
first cells of the shift register to the inputs of the PEs. It will be apparent to one of 
ordinary skill in the art that the presented realization satisfies the functionality con- 
straint for the data routing block of the Type 1 core DWT architecture. Indeed, at 
the beginning of every step, the shift register contains the concatenation of the 

20 vector of data accepted q } steps earlier with the vector of the first L max -2 compo- 
nents from the next accepted vectors. Then during l p time units it shifts the com- 
ponents by p positions upwards every time. 

Possible structures of PEs for the Type 1 core DWT architecture are presented in 
25 Figure 8 for the case of arbitrary P , P = i, P = 2, and ^ = 1^, (Figures 8,(a),(b),(c), 
and (d), respectively). Again it will be apparent to one of ordinary skill in the art 
that these structures implement the operations of (6) and (7) and, thus, satisfy the 
functionality description of the PEs. It should again be noted that these structures 
are suitable for a generic DWT implementation independent of the filter coeffi- 
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cients and PE structures optimized for specific filter coefficients can also be im- 
plemented. 



Type 2 core DWT architecture 
5 The Type 2 core DWT architecture implements a slightly modified version of the 
Algorithm 3.1 . The modification is based on the observation that operands of op- 
erations (6) and (7) are the same for pairs o^n,) and (i 2 ,« 2 ) of indices / and « 
such that 2i, +n l p = 2i 2 +n 2 p • Assuming an even P (the odd case is treated similarly 
but requires more notation for its representation), this means that, when imple- 
10 menting the operations of (6) and (7), the multiplicands required for use in time 
unit n = i,...,L p -i within branch / = o,...,2 y " j ' -pii-\ can obtained from the multipli- 
cands obtained at step n-i within branch f+/?/2. The corresponding computa- 
tional process is described with the following pseudocode where we denote 



15 




l k for k = 0,...,p-\ 



, \h k for *=0,...,/>-l 

lk \h k ll k - p * far* = p,...,L-l 
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Algorithm 3.2. 

1. For s=0,...,2 m ~ J -1 set x<£ s> =x(s-2 J :(s + l)-2 J -1); 
2. For s = s*(\),...,2 m ~ J +s*(J)-i 
5 For j = j 1 ,...,j 2 do in parallel 
Begin 

2.1. Set X (JA ' S - S " W) according to (4) 

2.2. For i = o,...,2 J ~ j -l do in parallel 

Begin 

10 For k = o,...,p-i 

Begin 

# z„(i\0,*) = Z 4 F-'''-'™(2i + *); 
set 

z #P (/,0,*) = ^x ( '- 1 - , - , * W); (2/ + fc) 

p-i # 
Compute 5„(f) = I^O'Afc) ; S WP (0 = 2^(1,0,/:) 5 

End 

15 For n = i,...,L p -l do in sequential 

Begin 

For k=o,...,p-i 
Begin 

sei Zu> («, n, *) - 1 ^ ^ ^ . tf . u _^ 0 , , (2 . + 1) . f . ^ ^ _ ^ 7 2 , 

\h' np + k z(i + p/2,n-l,k) if i<2 J ~ J -p/2 

20 set *„,(/,/!,*) = j fc i»-'.«'0)J(2,- + Jk) if i>2 J ~ j -p/2 

np + k 

End 

Compute S u ,(i) = 5 £l ,(i) + 2zu,(i,n,*) ; 5 w/ ,(0 = Sin»(0 + £««|.('.n.*); 

End 

Set x ( jjr s * u \i) = s LP {i); x^-' w (i) = s wl ,(0 



29 



End 

3. Form the output vector 



'tf&'T (4 j r"-») r ,«'7....{4r-'») r kf-r (4r'-'») r ' 



5 The general structure of the Type 2 core DWT architecture is presented in Figure 
4. in this case connections between PEs (the dashed lines) belonging to the same 
pipeline stage are valid. As follows from Figure 4 the Type 2 core DWT architec- 
ture is similar to the Type 1 core DWT architecture but now except for p inputs 
and two outputs (later on called main inputs and main outputs) every PE has addi- 
10 tional P inputs and 2 P outputs (later on called intermediate inputs and outputs). 
The 2 P intermediate outputs of PE Jtl+p , 2 are connected to the p intermediate in- 
puts of PEjj , i = o,...,2 J ~ J -p/2-i . The other connections within the Type 2 core 
DWT architecture are similar to those within the Type 1 core DWT architecture. 

15 The functionality of the blocks of the Type 2 core DWT architecture are substan- 
tially similar to those of the Type 1 core DWT architecture. The functionality of the 
data input block is exactly the same as for the case of the Type 1 core DWT ar- 
chitecture. 

20 The data routing block (of the stage y = i,... f j,) can, in general, be realized as an 
arbitrary circuitry which at the first time unit n = o of its every operation step ac- 
cepts a vector of 2 J ~~ j+1 components in parallel, and parallelly outputs a vector of 
the first 2 J ' J+l +p-2 components o,i 2 J ~ J+l + p-3 of a vector which is the concate- 
nation (in the chronological order) of the vectors accepted at the previous Q. 

25 steps, where q $ is defined in (8). Then at every time unit n = o,...,L p -i of that op- 
eration step the data routing block parallelly outputs the next subvector of P com- 
ponents 2 J - J+l +n P -2,...,2 J ~ j+l +(/i+i)p-3 of the same vector on its last p outputs. 
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The functionality of the PEs used in the Type 2 core DWT architecture at every 
time unit n = o 9 ...,L p -i of every operation step is to compute two inner products of a 

vector, say, x , either on its P main or p intermediate inputs with two vectors of 
predetermined coefficients, say lp' and hp' of length p as well as to compute a 
point-by-point product of x with lp' . At the time unit n = o the vector x is formed 
using the main p inputs of the PE and at time units n = i,„.,L p -l vector x is 

formed using the intermediate inputs of the PE. Results of both inner products 
computed during one operation step are accumulated and are passed to the two 
main outputs of the PE while the results of the point-by-point products are passed 
to the intermediate outputs of the PE. 

Similar to the case of the Type 1 core DWT architecture, it can be seen that the 
Type 2 core DWT architecture implements Algorithm 3.2 with time delay and time 
period characteristics given by (1 1) and (12). The other characteristics of the Type 
1 and Type 2 architectures are also similar. In particular, the Type 2 architecture 
is very fast, may be implemented as a semisystolic architecture and with a varying 
level of parallelism, providing an opportunity for creating a trade-off between time 
and hardware complexities. A difference between these two architectures is that 
the shift registers of data routing blocks of the Type 1 core DWT architecture are 
replaced with additional connections between PEs within the Type 2 core DWT 
architecture. 

A possible realisation of the Type 2 core DWT architecture 
A possible structure of the jth pipeline stage, j =!,..., 7, for the Type 2 core DWT 
architecture is depicted in Figure 9. An example of such a realisation for the case 
L max =6,y = 3 and P = 2 is shown in Figure 10. In this realisation the data routing 

block consists of Q } chain connected groups of 2 7 ~ ;+1 delays each, and a shift 

register of length L max -2 which shifts the values upwards by p positions every 

time unit. The 2 J ~ j+l inputs to the stage are connected in parallel to the first group 
of delays, the outputs of which are connected to the inputs of the next group of 
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delays efc. The outputs of the last Qj th group of delays form the first 2 J ~ j+l out- 
puts of the data routing block and are connected to the main inputs of the PEs. 
The outputs of the (<2 y -r)th group of delays, t = i,...,e, -1 , are connected in parallel 

to the 2 J ~ j+l consecutive cells of the shift register. The outputs of the (Qj -i)st 

5 group of delays are connected to the first 2 y ~* /+1 cells, the outputs of the (2, -2) nd 

group of delays are connected to the next 2 J ~ j+} cells etc. However, the first 
Qj =( L max -2)-(Qj -m J ~ J+l inputs of the stage are directly connected to the last q } 
cells of the shift register. The outputs from the first P -2 cells of the shift register 
form the last p-2 outputs of the data routing block and are connected to the main 
10 inputs of the PEs according to the connections within the general structure. It can 
be shown that the presented realization satisfies the functionality constraint for the 
data routing block of the Type 2 core DWT architecture. Indeed, at the beginning 
of every step, the first 2 J ~ J+l + p-2 components of the vector being the concatena- 
tion (in the chronological order) of the vectors accepted at previous Q } steps are 

15 formed at the outputs of the data routing block and then during every following 
time unit the next p components of that vector are formed at its last p outputs. 

A possible PE structure for the Type 2 core DWT architecture for the case of P = 2 
is presented in Figure 10,b. It will be apparent to one of ordinary skill in the art that 
20 structures for arbitrary p and for p = l , p = 2 , and p = L max , can be designed similar 
to those illustrated in Figures.8,(a),(b),(c), and (d). 

It should be noted that in the case p = L max this realisation of the Type 2 core DWT 
architecture is the same as the realisation of the Type 1 core DWT architecture 
25 depicted in Figure 6. 

Multi-core DWT architectures 

The two types of core DWT architectures described above may be implemented 
with varying levels of parallelism depending on the parameter P . Further flexibility 
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in the level of parallelism is achieved within multi-core DWT architectures by intro- 
ducing a new parameter r = i,...,2 m ~ J • The multi-core DWT architecture is, in fact 
obtained from a corresponding single-core DWT architecture by expanding it r 
times. Its general structure is presented on Figure 1 1 . The architecture consists of 
a data input block and J pipeline stages each stage containing a data routing 
block and a block of PEs. 

The data input block may be realized as word-serial or as word-parallel in a way 
similar to the case of core DWT architectures but in this case it now has n 3 par- 
allel outputs, which are connected to the ri 3 inputs of the data routing block of the 
first pipeline stage. The functionality of the data input block is to serially or paral- 
lelly accept and parallelly output a group of components of the input vector at the 
rate of ri J components per operation step. 

Consider firstly the Type 1 multi-core DWT architecture. In this case, the y th pipe- 
line stage, j = l,... 9 J, consists of a data routing block having ri J ~ 3JrX inputs 
ips(j)(P)j j PS (j)(r2 3 -j +l -v forming the input to the stage, and r2 J ~ J+l + P -2 outputs 
Odrb{j)(P)J >o Dmn (r2 J -j +} + P ~3) connected to the inputs of r2 J ~j PEs. Every PE has 
p inputs and two outputs where p<L mBX is a parameter of the realisation de- 
scribing the level of parallelism of every PE. Consecutive p outputs 
0/hmioo(^0m^^ of the data routing block of the jth, j = i 9 ...,J 9 

stage are connected to the p inputs of the / th, / = o,...,r2 y "- / '-i, PE (PEjj) of the 
same stage. First outputs of r2 J ~j PEs of the jth pipeline stage, j = i,...,j-i, form 
the outputs O/> 5O) (0),i ,o PSU) (r2 J -j -i) of that stage and are connected to the r2 J ~i 
inputs i PS(J+]) (oi} Jpsu+i)(r2 3 -j -i) of the data routing block of the next, o* + i)st, 
stage. First outputs of r PEs of the last, j th, stage form first r outputs 
out(0) ,...,<7«/(r-i) of the architecture. Second outputs of r2 J ~ j PEs of the yth pipe- 
line stage, y = l,.„ t y, form the (r2 J ~^)thtO (r2 J ~ j+l -n&t OUtputS out(r2 J -j\^out(r2 J -j +1 -\) 

of the architecture. 
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The data routing block (of the stage j = i 9 ...,j 9 ) can, in general, be realized as an 
arbitrary circuitry which at the first time unit * = o of its every operation step paral- 
lelly accepts a vector of r2 J ~J +l components, and then at every time unit 
n = o,...,l p -l of that operation step it parallelly outputs a vector of r2 J ~ J+} + P -i 

components n P ,n P +u. t (n+i) P + r2 J ^ +l -3 of a vector being the concatenation (in 

chronological order) of the vectors accepted at previous Q } steps (see (8)). The 

functionality of the PEs is exactly the same as in the case of the Type 1 core DWT 
architecture. 

Consider now the Type 2 multi-core DWT architecture. The data input block is ex- 
actly the same as in the case of the Type 1 multi-core DWT architecture. PEs 
used in the Type 2 multi-core DWT architecture and interconnections between 
them are similar to the case of the Type 2 (single)-core DWT architecture. The 
difference is that now there are ri J ~J (instead of 2 J ~i) PEs within the j th pipeline 
stage, j = i 9 ...,j, of the architecture. Data routing block has now r2 J ~j +x inputs and 
r2 J -j +l + P -2 outputs with similar connections to PEs as in the case of the Type 1 
multi-core DWT architecture. The data routing block (of the stage y = i f . ..,/,) can, in 
general, be realized as an arbitrary circuitry which at the first time unit n = o of its 
every operation step parallelly accepts a vector of r 2 J ~j +1 components, and paral- 
lelly outputs a vector of the first r2 J -j +l + P -2 components ox...,r2 J - J+1 + p-3 of a 
vector being the concatenation (in the chronological order) of the vectors accepted 
at previous Q j steps. Then at every time unit n = o,... 9 L p -l of that operation step it 

parallelly outputs next subvector of P components r2 J ~i +1 + np -2^r2 J ~^ + {n+v> P -3 
of the same vector on its last P outputs. 

The both types of multi-core DWT architectures are r times faster than the single- 
core DWT architectures, that is a linear speed-up with respect to the parameter r 
is achieved. The delay between input and corresponding output vectors is equal to 
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T d (CI) = {l m ' J 4- S(J)\u p]/r (14) 

time units and the throughput or the time period is equal to 

T p (Cl) = 2 m - J [L/p]/r (15) 

time units. Thus further speed-up and flexibility for trade-off between time and 
hardware complexities is achieved within multi-core DWT architectures. In addi- 
tion, the architectures are modular and regular and may be implemented as semi- 
systolic arrays. 

As a possible realisation of the multi-core DWT architecture for the case of 
p = L = L max and r = 2 m ~ y the DWT flowgraph itself (see Figure 2) may be consid- 
ered where nodes (rectangles) represent PEs and small circles represent latches. 

The variable resolution DWT architecture 

The above-described architectures implement DWTs with the number of octaves 
not exceeding a given number j . They may implement DWTs with smaller than j 
number of octaves though with some loss in hardware utilisation. The variable 
resolution DWT architecture implements DWTs with an arbitrary number r of oc- 
taves whereas the efficiency of the architecture remains approximately 100% 
whenever r is larger than or equal to a given number y min . 

The general structure of the variable resolution DWT architecture is shown on 
Figure 12(a). It consists of a core DWT architecture corresponding to y min decom- 
position levels and an arbitrary serial DWT architecture, for, instance, an RPA- 
based one ([14]-[17], [19]-[20], [22]). The core DWT architecture implements the 
first y min octaves of the r -octave DWT. The low-pass results from the out(0) of 
the core DWT architecture are passed to the serial DWT architecture. The serial 
DWT architecture implements the last j'-j min octaves of the /'-octave DWT. 
Since the core DWT architecture may be implemented with varying level of paral- 
lelism it can be balanced with the serial DWT architecture in such a way that ap- 
proximately 100% of hardware utilisation is achieved whenever r>y min . 
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To achieve the balancing between the two parts the core DWT architecture must 
implemement a y min -octave Appoint DWT with the same throughput or faster as 

the serial architecture implements (r~J min )-octave M -point DWT (M - (/V72^ min )) 
Serial architectures found in the literature implement a M-point DWT either in 2M 
time units ([14], [15]) or in Mtime units ([14]-[19]) correspondingly employing ei- 
ther L or 2L basic units (BUs, multiplier-adder pairs). They can be scaled down to 
contain an arbitrary number K <2L of BUs so that an M-point DWT would be im- 
plemented in m\ilik \ time units. Since the (Type 1 or Type 2) core DWT archi- 
tecture implements a y min -octave /V-point DWT in n\li p~]/2 Jm ™ time units the 
balancing condition becomes \li p]<\lLl K~] which will be satisfied if p = \Kll\. 
With this condition the variable resolution DWT architecture will consist of a total 
number 



A variable resolution DWT architecture based on a multi-core DWT architecture 
may also be constructed (see Figure 12(b)) where now a data routing block is in- 
serted between the multi-core and serial DWT architectures. The functionality of 
the data routing block is to parallelly accept and serially output digits at the rate of 
r samples per operation step. The balancing condition in this case is rp = \Kil\ 
and the area time characteristics are 




of BUs and will implement a J* -octave /V-point DWT in 

T d =N[2L/K~]/2 Jm ™ 



time units. 



A = 2pr(2 J ™ -l)+K = 



K2 Jm ™. if K is even 
<iK+l)2 J ™™-), if K is odd ' 



T d =N{2L/K 1 l/2 J ™™ 



Table 1 presents a comparative performance of the proposed architectures with 
some conventional architectures. In this table, as it is commonly accepted in the 
literature, the area of the architectures was counted as the number of used multi- 
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plier-adder pairs which are the basic units (BUs) in DWT architectures. The time 
unit is counted as time period of one multiplication since this is the critical pipeline 
stage. Characteristics of the DWT architectures proposed according to the inven- 
tion shown in the last seven rows in Table 1 , are given as for arbitrary realisation 
parameters L max , p , and r as well as for some examples of parameter choices. It 
should be mentioned that the numbers of BUs used in the proposed architectures 
are given assuming the PE examples of Figure 8 (where PE with p inputs con- 
tains 2p BUs). However, PEs could be further optimized to involve less number of 
BUs. 

For convenience, in Table 2 numerical examples of Area-time characteristics are 
presented for the choice of the DWT parameters j =3 or / =4, TV = 1024, and L = 9 
(which corresponds to the most popular DWT, the Daubechies 9/7 wavelet). Table 
3 presents numerical examples for the case 7=3 or 7 = 4, // = 1024, and 5 (the 
Daubechies 5/3 wavelet). The gate counts presented in these tables have been 
found assuming that a BU consists of a 16-bit Booth multiplier followed by a hier- 
archical 32-bit adder and thus involves a total of 1914 gates (see [37]). Figure 13 
gives a graphical representation of some rows from Table 2. It should be noted 
that the line corresponding to the proposed architectures may be continued much 
longer though these non-present cases require rather large silicon area which 
might be impractical at the current state of the technology. 

As follows from these illustrations, the proposed architectures, compared to the 
conventional ones, demonstrate excellent time characteristics at moderate area 
requirements. Advantages of the proposed architectures are best seen when con- 
sidering the performances with respect to xrj criterion, which is commonly used 

to estimate performances of high-speed oriented architectures. Note that the first 
row of the Tables represent a general purpose DSP architecture. Architectures 
presented in the next two rows are either non-pipelined or restricted (only two 
stage) pipelined ones and they operate at approximately 100% hardware utilisa- 
tion as is the case for our proposed architectures. So their performance is "pro- 
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portional" to the performance of our architectures which however are much more 
flexible in the level of parallelism resulting in a wide range of time and area com- 
plexities. The fourth row of the tables presents J stage pipelined architectures 
with poor hardware utilisation and consequently a poor performance. The fifth to 
seventh rows of the tables present architectures from previous publications which 
are J stage pipelined and achieve 100% hardware utilisation and good perform- 
ance but do not allow a flexible range of area and time complexities as do the ar- 
chitectures proposed according to the invention. 

In the foregoing there has been discussion of general structures of "universal" 
wavelet transformers which are able to implement the wavelet transform with arbi- 
trary parameters such as the filter lengths and coefficients, input length, and the 
number of decomposition levels. Further optimisation of architectures for a specific 
discrete wavelet transform (corresponding to a specific set of above parameters) 
is possible by optimizing the structure of processing elements (PEs) included in 
the architecture. 

The invention can be implemented as a dedicated semisystolic VLSI circuit using 
CMOS technology. This can be either a stand-alone device or an embedded ac- 
celerator to a general-purpose processor. The proposed architectures can be im- 
plemented with varying level of parallelism which leads to varying cost and per- 
formance. The choice of the mode of implementation as well as the desired level 
of parallelism depends on the application field. The architectures as they have 
been described are dedicated to arbitrary DWTs. However, they can be further 
optimized and implemented as dedicated to a specific DWT. This may be desir- 
able in application to, e.g. JPEG 2000 where Daubechies 5/3 or 9/7 wavelets are 
planned to be the basic DWTs. 

Particular implementations and embodiments of the invention have been de- 
scribed. It is clear to a person skilled in the art that the invention is not restricted to 
details of the embodiments presented above, but that it can be implemented in 
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other embodiments using equivalent means without deviating from the character- 
istics of the invention. The scope of the invention is only restricted by the attached 
patent claims. 

5 Abbreviations 



ASIC 


Application Specific Integrated Circuits 


CMOS 


Complementary Metal Oxide Silicon 


DSP 


Digital Signal Processor 


DWT 


Discrete Wavelet Transform 


FPP 


Fully Parallel-Pipelined (DWT architecture) 


LPP 


Limited Parallel-Pipelined (DWT architecture) 


PE 


Processor element 
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